[feat][plugin] make ATOM mla attention works for vllm by XiaobingSuper · Pull Request #265 · ROCm/ATOM

XiaobingSuper · 2026-03-04T11:49:42Z

Motivation

Following #126, this PR makes ATOM mla attention work for the vLLM plugin model. Note: the sparse mla is not supported now and will be implemented in the next step.

Technical Details

The design tails can be seen in #126.

Test Plan

This PR does a test for Kimi-K2-Thinking-MXFP4 mode with TP4 on mi355:

export SAFETENSORS_FAST_GPU=1
export VLLM_ROCM_USE_AITER=1
export VLLM_RPC_TIMEOUT=1800000

export VLLM_CACHE_ROOT=/root/.cache/vllm
export TORCHINDUCTOR_CACHE_DIR=/root/.cache/inductor
export HIP_VISIBLE_DEVICES=0,1,2,3
# quick allreduce
export AITER_QUICK_REDUCE_QUANTIZATION=INT4
export ATOM_PROFILER_MORE=1

export VLLM_TORCH_PROFILER_RECORD_SHAPES=1

model_path= Kimi-K2-Thinking-MXFP4
vllm serve $model_path \
    --host localhost \
    --port 8001 \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \
    --trust-remote-code \
    --disable-log-requests \
    --gpu_memory_utilization 0.9 \
    --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
    --kv-cache-dtype fp8 \
    --max-num-batched-tokens 18432 \
    --max-model-len 16384 \
    --no-enable-prefix-caching

Test Result

gsmk result"

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.9371|±  |0.0067|
|     |       |strict-match    |     3|exact_match|↑  |0.9363|±  |0.0067|

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

Adds vLLM plugin-mode support for ATOM’s MLA attention path (non-sparse), including backend selection, metadata plumbing, and DeepSeek V3 model registration/loading so MLA can run end-to-end under vLLM.

Changes:

Route vLLM’s use_mla attention selection to an ATOM MLA backend and add MLA-specific plugin-mode metadata builders.
Implement plugin-mode MLA forward/prefill/decode logic (including positions capture for graph mode).
Register DeepSeek V3 as a supported vLLM plugin model and add a plugin-mode load_weights implementation.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
atom/utils/backends.py	Extends compilation-cache hashing to ignore `<frozen os>` traced “files”.
atom/plugin/vllm/register.py	Patches vLLM `process_weights_after_loading` for Attention/MLAAttention.
atom/plugin/vllm/platform.py	Selects ATOM MLA backend when `attn_selector_config.use_mla` is true.
atom/plugin/vllm/model_wrapper.py	Copies `positions` into a static buffer for graph-mode MLA correctness.
atom/plugin/attention_mla.py	New: plugin-mode MLAAttention implementation helpers (prefill/decode/DCP).
atom/plugin/attention.py	Adds MLA plugin-mode metadata builders + backend wiring; renames plugin metadata class.
atom/models/deepseek_v2.py	Adds DeepSeek V3 support + plugin-mode `load_weights`.
atom/model_ops/utils.py	Removes duplicate `per_tensor_dequantize` implementation (keeps the canonical one).
atom/model_ops/paged_attention.py	Integrates vLLM MLAAttention usage and allocates a shared `positions` buffer.
atom/model_ops/linear.py	Ensures activation tensor is contiguous before quantizer `.view()` calls.
atom/model_ops/base_attention.py	Adjusts MLA unified-attn path to apply `o_proj` outside MLA impl.
atom/model_ops/attentions/aiter_mla.py	Decorates MLA backend/builder for plugin mode; builder init adjustments.
atom/model_ops/attentions/aiter_attention.py	Removes unused import.
atom/model_ops/attention_mla.py	Adds plugin-mode hooks/decorator and splits `v_up` and `o_proj` responsibilities.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/plugin/attention.py

atom/model_ops/paged_attention.py

atom/plugin/attention.py

atom/plugin/attention_mla.py

atom/plugin/attention.py

atom/plugin/vllm/register.py

XiaobingSuper · 2026-03-04T12:02:30Z

atom/model_ops/linear.py

+                    # quant_func will call view, so we need to call contiguous to avoid view error
                    x, x_scale = quant_func(
-                        x,
+                        x.contiguous(),


This is required for the deepseek-r1 model, where x is a sliced tensor that cannot be viewed.

we will do something else to avoid contiguous which will introduce mem copy here, all our quant should support non-contiguous tensor already... we hit any issue here?

Yes, I meet this issue for the chunked prefill path. see https://github.com/XiaobingSuper/ATOM/blob/xiaobing/oot_kimi/atom/plugin/attention_mla.py#L345.

then do contiguous at that place maybe, i don't like loss any perf

Yes, I updated the code, and do contiguous at plugin side.

XiaobingSuper · 2026-03-04T12:14:45Z

DeepSeek-R1-0528 with TP=8 has also been tested:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.9424|±  |0.0064|
|     |       |strict-match    |     3|exact_match|↑  |0.9363|±  |0.0067|

atom/model_ops/attention_mla.py

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/plugin/vllm/register.py

atom/plugin/attention_mla.py

atom/plugin/attention.py

atom/model_ops/paged_attention.py

atom/model_ops/attention_mla.py

ChuanLi1101

Left my comment FYI.

atom/model_ops/attention_mla.py

atom/plugin/attention_mla.py

atom/model_ops/paged_attention.py

atom/plugin/attention_mla.py

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/plugin/attention_mla.py

atom/plugin/vllm/register.py

atom/plugin/attention.py

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/model_ops/paged_attention.py

atom/plugin/attention_mla.py

atom/plugin/attention.py

atom/model_ops/paged_attention.py

ChuanLi1101

LGTM, thanks for the quick turnaround.

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/plugin/vllm/model_wrapper.py

Signed-off-by: XiaobingSuper <xiaobingzhangupc@gmail.com>

atom/models/deepseek_v2.py

atom/model_ops/attention_mla.py

valarLip · 2026-03-06T09:35:02Z

atom/model_ops/attention_mla.py

            # dummy run: skip real attention and return
            output_shape = list(q.shape)
-            output_shape[-1] = 7168
+            output_shape[-1] = self.num_heads * self.v_head_dim


self.num_heads * self.v_head_dim looks like not eaquals to 7168 for deepseek

This is o_proj's input, see atom path:
and plugin path:

The reason is that vllm do a_proj outside of the attention backend.

plugin path is also in our repo... then why we have to move o_proj out of attn

This change is for the fallback path, i.e., for plugin mode, but use vllm attn backend, because we use vllm MLAAttention class(here is self.attn), the forward path doesn't has o_proj, see https://github.com/vllm-project/vllm/blob/v0.15.1/vllm/attention/layer.py#L640, this is only for attention compute.

@valarLip The plugin mode using vllm atten backend will be like(setting ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=1): https://github.com/zejunchen-zejun/ATOM/blob/zejun/plugin_for_atom_1223/recipes/vLLM-ATOM-OOT-Plugin-Backend.md#launching-server-of-vllm-with-atom-oot-plugin-platform

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/model_ops/paged_attention.py

atom/plugin/vllm/platform.py

atom/plugin/attention.py

valarLip · 2026-03-06T15:48:25Z

atom/model_ops/paged_attention.py

            **kwargs,
        )
-
+        impl_args["head_size" if self.use_mla else "head_dim"] = head_dim


this is also comes from vllm? i would like we always use head_dim

Yes, vllm use head_size, see https://github.com/vllm-project/vllm/blob/v0.15.1/vllm/attention/layer.py#L579. Before this PR 6c40248, it also use head_size.

valarLip · 2026-03-06T15:50:06Z

atom/model_ops/attention_mla.py

        self.layer_num = layer_num

-    def process_weights_after_loading(self):
+    def process_weights_after_loading(self, act_dtype: Optional[torch.dtype] = None):


@zejunchen-zejun we need add this arg?

For vLLM side, it's calling path like https://github.com/vllm-project/vllm/blob/1892993bc18e243e2c05841314c5e9c06a80c70d/vllm/attention/layer.py#L675, it needs such a arg.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T15:51:39Z

atom/model_ops/paged_attention.py

@@ -153,7 +204,8 @@ def __init__(
            k_norm=k_norm,
            **kwargs,
        )
-
+        impl_args["head_size" if self.use_mla else "head_dim"] = head_dim
+        self.impl = impl_cls(**impl_args)


When use_mla is True, impl_cls is atom.model_ops.attention_mla.MLAAttention which now expects head_size (not head_dim). This code always includes head_dim in impl_args and then also adds head_size, so MLAAttention will receive an unexpected head_dim kwarg and raise at construction time. Build impl_args conditionally (only pass head_dim for MHA, and only pass head_size for MLA), or remove the unconditional head_dim entry before instantiating the MLA impl.

Copilot AI review requested due to automatic review settings March 4, 2026 11:49

Copilot started reviewing on behalf of XiaobingSuper March 4, 2026 11:50 View session

XiaobingSuper force-pushed the xiaobing/oot_kimi branch from 0a9f742 to 77ccd4d Compare March 4, 2026 11:52

XiaobingSuper requested review from wuhuikx and zejunchen-zejun March 4, 2026 11:53

Copilot AI reviewed Mar 4, 2026

View reviewed changes

XiaobingSuper commented Mar 4, 2026

View reviewed changes

atom/model_ops/attention_mla.py Show resolved Hide resolved

Copilot AI review requested due to automatic review settings March 4, 2026 13:00

Copilot started reviewing on behalf of XiaobingSuper March 4, 2026 13:02 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

atom/plugin/vllm/register.py Show resolved Hide resolved

atom/plugin/attention_mla.py Show resolved Hide resolved

atom/plugin/attention_mla.py Outdated Show resolved Hide resolved

atom/plugin/attention.py Show resolved Hide resolved

wuhuikx requested review from ChuanLi1101, ZhangLirong-amd, sunway513 and valarLip March 4, 2026 14:15

ZhangLirong-amd reviewed Mar 5, 2026

View reviewed changes

atom/model_ops/paged_attention.py Outdated Show resolved Hide resolved

ganyi1996ppo reviewed Mar 5, 2026

View reviewed changes

atom/model_ops/attention_mla.py Show resolved Hide resolved

ChuanLi1101 reviewed Mar 5, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 5, 2026 05:49

Copilot started reviewing on behalf of XiaobingSuper March 5, 2026 05:51 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

atom/plugin/attention_mla.py Outdated Show resolved Hide resolved

atom/plugin/attention_mla.py Outdated Show resolved Hide resolved

atom/plugin/vllm/register.py Show resolved Hide resolved

atom/plugin/attention.py Show resolved Hide resolved

XiaobingSuper requested review from ChuanLi1101, ZhangLirong-amd and Copilot March 5, 2026 06:51

Copilot started reviewing on behalf of XiaobingSuper March 5, 2026 06:59 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

ChuanLi1101 previously approved these changes Mar 5, 2026

View reviewed changes

XiaobingSuper dismissed ChuanLi1101’s stale review via e046d06 March 5, 2026 07:55

XiaobingSuper force-pushed the xiaobing/oot_kimi branch from f5260c0 to e046d06 Compare March 5, 2026 07:55

Copilot AI review requested due to automatic review settings March 6, 2026 07:27

Copilot started reviewing on behalf of XiaobingSuper March 6, 2026 07:28 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

atom/plugin/vllm/model_wrapper.py Show resolved Hide resolved

XiaobingSuper added 9 commits March 6, 2026 07:57

[feat][plugin] make ATOM mla attention works for vllm

fef62be

Signed-off-by: XiaobingSuper <xiaobingzhangupc@gmail.com>

recover unrelated code

b9a0acf

simplify attention.py code

07fd497

Signed-off-by: XiaobingSuper <xiaobingzhangupc@gmail.com>

update postions init

d45cd8a

cleare code v1

5e633ed

update scale use

007c829

fix typo

4d9971f

fix ruff issue

45f7c4e

update base_attention

5c460b9

XiaobingSuper force-pushed the xiaobing/oot_kimi branch from 34a1e24 to 5c460b9 Compare March 6, 2026 08:05

ZhangLirong-amd previously approved these changes Mar 6, 2026

View reviewed changes

valarLip requested changes Mar 6, 2026

View reviewed changes

clear mla init

f7341d2

Copilot AI review requested due to automatic review settings March 6, 2026 10:23

XiaobingSuper dismissed ZhangLirong-amd’s stale review via f7341d2 March 6, 2026 10:23

XiaobingSuper requested a review from valarLip March 6, 2026 10:24

Copilot started reviewing on behalf of XiaobingSuper March 6, 2026 10:25 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

atom/model_ops/paged_attention.py Show resolved Hide resolved

atom/plugin/vllm/platform.py Show resolved Hide resolved

atom/plugin/attention.py Show resolved Hide resolved

XiaobingSuper added 2 commits March 6, 2026 10:38

clear code

5d696d7

avoid copy for quant_func

9e8486a

Copilot AI review requested due to automatic review settings March 6, 2026 15:42

Copilot started reviewing on behalf of XiaobingSuper March 6, 2026 15:43 View session

valarLip reviewed Mar 6, 2026

View reviewed changes

Copilot AI reviewed Mar 6, 2026

View reviewed changes

simlpe code

f57e9ff

XiaobingSuper requested a review from zejunchen-zejun March 7, 2026 12:45

Conversation

XiaobingSuper commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XiaobingSuper Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XiaobingSuper commented Mar 4, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChuanLi1101 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChuanLi1101 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

XiaobingSuper commented Mar 4, 2026 •

edited

Loading

XiaobingSuper Mar 6, 2026 •

edited

Loading

XiaobingSuper Mar 6, 2026 •

edited

Loading

XiaobingSuper Mar 6, 2026 •

edited

Loading

XiaobingSuper Mar 6, 2026 •

edited

Loading